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Robust Object Tracking with a Hierarchical Ensemble Framework 


Mengmeng Wang^ Yong Liu^ and Rong Xiong^ 


Abstract —Autonomous robots enjoy a wide popularity nowa¬ 
days and have been applied in many applications, such as home 
security, entertainment, delivery, navigation and guidance. It is 
vital for robots to track objects accurately in real time in these 
applications, so it is necessary to focus on tracking algorithms 
to improve the robustness, speed and accuracy. In this paper, 
we propose a real-time robust object tracking algorithm based 
on a hierarchical ensemble framework which incorporates 
information including individual pixel features, local patches 
and holistic target models. The framework combines multiple 
ensemble models simultaneously instead of using a single 
ensemble model individually. A discriminative model which 
accounts for the matching degree of local patches is adopted 
via a bottom ensemble layer, and a generative model which 
exploits holistic templates is used to search for the object based 
on the middle ensemble layer as well as an adaptive Kalman 
filter. We test the proposed tracker on challenging benchmark 
image sequences. The experimental results demonstrate that 
the proposed tracker performs superiorly against several state- 
of-the-art algorithms, especially when the appearance changes 
dramatically and the occlusions occur. 

1. INTRODUCTION 

Visual tracking is a well-studied problem in computer 
vision with a variety of applications such as surveillance, 
human motion analysis, robot guidance, human-computer 
interaction and so on. Recent attention has been focused 
to visual tracking in the robotic domains [1], [2]. However, 
due to the diverse environment and the complex motion of 
the robots, several tracking conditions such as occlusions, 
deformations, fast motion and background clutters remain 
difficult. 

There are three fundamental tracking components that are 
essential [3] for improving performance of tracking: (1) the 
background information; (2) local appearance models; (3) 
motion models. This paper presents a hierarchical tracking 
framework which takes the above components into account. 
We model the object as an ensemble three-layer structure 
which can incorporate information including individual pixel 
features, the local patches and the target bounding box. 
The first component, i.e. the background information, is 
essential to overcome the background clutters due to the 
complexity of the environment. In our proposed method, we 
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Fig. 1: An overview of the architecture of the layers. 1- 
combine the weak classifiers of each sub-patch to obtain 
the corresponding base ensembles and weights in the bot¬ 
tom layer; 2-combine the base ensembles to generate the 
measurement of the object in the middle layer; 3-employ an 
adaptive Kalman filter to increase the time consistency in the 
top layer; 4-update the top layer; 5-re-extract the sub-patches 
and update their weights in the middle layer; 6-update the 
parameters of weak classifiers in the bottom layer. 


incorporate both the object and the background information 
into classifiers. For the second component, most of existing 
approaches [4], [5], which represent the target with a limited 
number of non-overlapping or regular local regions. So they 
may not cope well with the large deformations of the target. 
While our hierarchical tracker models the target with a series 
of overlapping and randomly sampled regions. We introduce 
the compressive sensing theory [6], [7] which significantly 
reduces the dimension of the pixel features in local regions. 
An overall schematic for the tracker is shown in Fig[^ For 
each sub-patch, we build a bottom ensemble layer which 
combines a collection of weak classifiers on the compressive 
features for the sub-patch into a strong classifier as a base 
ensemble. In the middle ensemble layer, we aggregate these 
base ensembles to generate the measurement of the target. 
As the robots move almost all the time when tracking an 
object, our approach needs to consider the third component 
and introduce an adaptive Kalman filter [8] in the top layer 
to consider the motion models and the temporal consistency 
in the target bounding box level. Above all, the contributions 
of our method are summarized as follows: 

1) We legitimately organize compressive features, over¬ 
lapping sub-patches and holistic target models to cap¬ 
ture the detailed appearance of the object; 

2) We propose a hierarchical ensemble framework that 
combines multiple ensemble models simultaneously 




































instead of using a single ensemble model individually; 

3) We employ compressive sensing method to signifi¬ 
cantly reduce the feature dimensions so that our ap¬ 
proach can handle colorful images without suffering 
from exponential memory explosion; 

4) We take the motion model into consideration to over¬ 
come the temporary occlusions, missing and false 
detections with an adaptive Kalman filter. 

In the experiment, we compare the proposed method 
against state-of-the-art tracking approaches which are fea¬ 
sible for robotic applications in terms of computational 
complexity and hardware requirements using an online object 
tracking benchmark [3]. Our method obtains superior results 
compared with the state-of-the-art tracking approaches. The 
results also show that our method performs much better 
in the moving human tracking than other approaches for 
the conditions with occlusions, deformations, background 
clutters and scale variations. 

II. Related Work 

Recent tracking algorithms are developed in terms of 
three primary components: target representation, matching 
mechanism, and model update mechanism. 

Target representation plays a pivotal role in visual tracking, 
and numerous representation schemes have been proposed. 
Several factors need to be considered for an effective appear¬ 
ance model in target representation. First, the features to rep¬ 
resent the objects have many choices such as color histogram 
[9], superpixels [10], Haar-like features [11]-[13], etc. Sec¬ 
ond, the templates to represent the objects can be global 
or local. Global templates [1], [12] are easy to construct 
the object representation that contains information of the 
whole object. However, for the tracking problem of robots, 
holistic templates will have difficulty in handling significant 
appearance changes and deformations of the targets. While 
local templates [4], [14], [15] are more robust and fiexible 
to these conditions. But the geometrical relationships for 
local patches remain tough since the environmental clutter, 
occlusions and partially similar objects can often distract 
such local patches and lead to drift. 

Matching mechanism is used to classify candidate regions 
which are most similar to the target from background. There 
are two main streams of research on this: One is generative 
model which typically searches for the most similar candi¬ 
date to the target within a neighborhood [16]-[18]. Another 
is discriminative model which poses the tracking problem 
as a binary classification task that determines the decision 
boundary for separating the target from the background [12], 
[13], [19], [20]. 

Online model update mechanism is quite essential for 
robust visual tracking to deal with appearance variations. 
Addressing on this problem, Kalal et al. [15] develop a boot¬ 
strapping classifier to select positive and negative samples 
for model update. Grabner et al. [21] formulate the update 
problem as a semi-supervised task where the classifier is up¬ 
dated with both labeled and unlabeled data. However, online 
boosting requires that the data should be independent and 


identically distributed. This is not always satisfied in visual 
tracking because the data are often temporally correlated. 

In the proposed method, we adopt the compressive sensing 
theory to reduce the dimension of Haar-like features and 
this process is operated similarly to [12]. We employ a joint 
representation which considers both global and local models 
of the target to better handle significant appearance changes, 
deformations, similar object identification and occlusions. 
Our local models are efficiently constructed with a number of 
overlapping and randomly sampled local patches and we re¬ 
extract the sub-patches at each time step to avoid the drifting 
caused by arbitrary sub-patch. We adopt a discriminative 
model via the bottom ensemble layer to account for the 
matching degree of local patches, and a generative model 
is used to seek for the object through the middle ensemble 
layer as well as an adaptive Kalman filter. For model update, 
we employ ensemble learning to update the patches and clas¬ 
sifiers to capture appearance variations and reduce tracking 
drifts. 

HI. Robust Object Tracking with a Hierarchical 
Ensemble Framework 

In this section, we give a detailed description of the 
proposed hierarchical ensemble tracking(HET) framework. 
It is composed of two ensemble layers and a Kalman filter 
layer. At each time step, we start with detecting several 
samples around each local patch and try to formulate the 
corresponding base ensemble for each sub-patch with several 
weak classifiers in the bottom ensemble layer. Second, we 
recover the target location in the middle ensemble layer by 
incorporating these base ensembles, and regard this location 
as the measurement to an adaptive Kalman filter. Third, we 
ascertain the ultimate object location at the current frame 
with a motion model and the measurement via the adaptive 
Kalman filter in the top layer. Finally, we update the model 
by re-extracting the local overlapping image sub-patches 
efficiently in the final target region with a random spatial 
layout and updating the parameters of weak classifiers for 
tracking in the next frame. 

A. Local Compressive Appearance Model 

The compressive sensing theory shows that if the dimen¬ 
sion of the feature space is sufficiently high, these features 
can be projected to a randomly chosen low dimensional space 
which contains enough information to preserve most of the 
salient information of the original high-dimensional features 
through a random projection matrix [22]. The signal can 
be recovered as long as the projection matrix R follows 
the Restricted Isometry Property (RIP) [7]. Representing the 
object appearance by regions allows the proposed tracker to 
better handle occlusions and large appearance changes. The 
compressive appearance model also allows us to process a 
large number of regions in real-time. 

In this paper, we build compressively sensed versions 
of sub-patches. Randomly extracted sub-patches are used 
and the relative location between sub-patches and the target 
bounding box are established when the tracking window is 



given by a detector or manual label at the first frame. Every 
sub-patch is represented by four components: a compressive 
feature vector g^, a classification score Cq, a relative location 
Ap„ where Ap^ = [Ax^,Ay^]^ denotes the relative upper-left 
corner coordinate to upper-left comer of the target window, 
and the location of the sub-patch itself in the image space 
P^ = • Denoted ^-th sub-patch Xq as: 

( 1 ) 

It is notable that the width and the height of each sub¬ 
patch are identical, denoted as w and h, which are determined 
at beginning. After extracting these Q local overlapping 
image sub-patches A = {Ai,A 2 ,..., Ag}, where Q denotes the 
number of sub-patches, for the ^-th sub-patch, we sample 
N sub-patches with the same size as the ^-th sub-patch, 
whose Euclidean distances to the sub-patch is smaller than a 
threshold j3 that is fixed through the sequence. These samples 
can form a matrix G Then we 

present all samples as S = ...,S^] G 

In order to find a kind of feature that is invariant to 
scale, we adopt a multiscale image representation that is 
often formed by convolving the input image with a Gaussian 
filter of different spatial variances and speed up the process 
via integral image method. We replace the Gaussian filter 
with rectangle filters for computation consideration [ 12 ]. 
Eor N samples of the ^-th sub-patch S^, we obtain the 
feature matrix = [h^,h 2 , G the k-th column 

G where ^ ^ w x /z denotes the large multiscale 
feature vector of the k-th sample that is filtered with rectangle 
filters and concatenated as such a high-dimensional feature 
vector. Eeatures of the total N x Q samples can denote as 
H= [H^H^...,H2] 

We adopt a sparse random matrix R G m n to 

reduce the original feature space n into a lower-dimensional 
space m such as G for ^-th sub¬ 

patch. Concatenating Q local patches together, we obtain L = 
...,L^] G computed by 

L = RH (2) 


A typical choice of such a measurement matrix is the 
random Gaussian matrix Rij ^ N(0,1). But when n is huge, 
the computational loads are still heavy because the random 
Gaussian matrix is dense. Thus it is common to employ 
a very sparse random measurement matrix that satisfies a 
weaker property than RIP but almost as accurate as the 
conventional random Gaussian matrix [23], as Q, where 
Rij denotes the element in the z-th row and 7 -th column 
of R. This random matrix is fixed at the beginning and easy 
to compute for real-time tracking by fixing the maximum 
number Z of nonzero elements to be a lower number. The 
scheme to produce the random matrix in this work is similar 
to [12]. We illustrate the dimension reduction process in 



Eig. 2: An illustration for compressive representation for an 
arbitrary sample. Denote its high-dimensional feature vector 
as h G After the dimension reduction from n to m, we get 
its m-dimensional feature vector 1 = [/i, /2, •••, /m]^ ^ Each 
element in 1 is linearly combined by the feature values of less 
than Z rectangles (yellow) inside the sample region(red) and 
the coefficient of the combination is in the rows of R. The 
feature values of each rectangle is actually the convolution 
from the corresponding rectangle filter that is the same size 
as the rectangle itself, i.e., the sum of gray values of all 
pixels inside it which can be computed very fast using the 
integral map. 


FigH 



y/p, with probability ^ 

0 , with probability 1 — ^ 
— y/p^ with probability ^ 


( 3 ) 


B. Classification via Ensemble Layers 


To link up the individual pixels with the local patches, 
we employ the naive Bayesian classifier to construct the 
pool of weak classifiers corresponding to each individual 
compressive feature in the bottom layer. We assume the 
compressive m-dimensional features of each sub-patch are 
independently distributed and build m weak classifiers cor¬ 
responding to these features by considering both the object 
and the background information. Since R is fixed during the 
tracking process, the way to compress the high dimensional 
features of samples stays consistent for all sub-patches. 
Let 1 = [/i,/ 2 ,...,/m]^ ^ denote an arbitrary compressive 
sample, for the z-th compressive feature, the z-th classifier is 
constructed as follows: 


/a)=log 


Vp(>’ = o|/i)y 


= log 


/ p(/,|y = l)p(y= 1) \ 

\p{h\y = o)p{y = o))' 
( 4 ) 


where y G {0,1} is a binary variable which represents the 
sample label. We assume p{y = 1) = p{y = 0) by sampling 
the same quantity of positive and negative samples at update 
step. The conditional distributions p{li\y) are almost Gaus¬ 
sian due to the random projections of the high dimension 
features [24]. Thus we have: 


p{li\y=l) ~N(M/,cT/),/ 7 (/,-b = 0) , (5) 

where /i/ mean and standard deviation 

of the positive (negative) class. 

Then we introduce an ensemble strategy which combines 
the output of weak classifiers to create a strong classifier as 








strong classifier F(l) 


Fig. 3: The ensemble process in the bottom layer for an 
arbitrary sample 1 = ^ 


a base ensemble to detect the sub-patches as shown in Figj^ 
denoted as 

m 

^(l) = L/a), ( 6 ) 

i=l 

For the ^-th sub-patch, we seek its N samples for matching 
and its matching score Cq like: 

g« = argmax F (if), k = 

Cq = F {g ‘^), 

We match all Q sub-patches in the same way in the 
bottom layer and obtain the compressive feature of their 
optimal matching G = [gSg^, G ^nd their 

scores c= [ci,C 2 , In the ensemble learning field, it 

is often found that improved performance can be obtained by 
combining multiple models simultaneously like ([^, instead 
of just using a single model individually [25]. 

In the middle layer, we propose a novel ensemble strategy 
to acquire the observed location of the object from the base 
ensembles like Figvia these Q detected local patches. 

Suppose the actual location of the object we are trying 
to predict is given by H(A), and = Ap/ + p/ denotes 
the i-th hypothesis of object location obtained by the i-th 
detected sub-patch. The output of each sub-patch model can 
be written as the true value plus an error in this form: 

y,-(A)=H(A) + £,-(A) (8) 


To be convenient for comparison, we adapt the scores 
of sub-patches c = [ci,C 2 ,by the zero-mean nor¬ 
malization, then rescale them to (0 = [coi,( 02 ,cog]^, cOi G 
[0.1,0.9]. CO is regarded as the weights of candidates that 
obtained by the corresponding sub-patches. We update these 
weights adaptively for each new frame. The combined pre¬ 
diction is given by 

1 ^ 

yCOM = ^ ^iyi : W = (0i + ... + (Oq (9) 
i=\ 

The average sum-of-squares error then takes the form as 
follows: 


(y,(A)-H(A)r 


= Ej 




( 10 ) 



Fig. 4: The first column is the object at previous frame, 
second column denotes the randomly extracted sub-patches. 
Then transfer the sub-patches to the bottom ensemble layer 
to gain the base ensembles. The fourth column shows the 
scores and corresponding target candidates of the local 
patches which are the output of the base ensembles. Finally, 
we employ the proposed ensemble strategy to obtain the 
observation of object in the middle layer. 


where E;^ [•] denotes a frequentist expectation. The average 
error made by the sub-patch models acting individually is 


Eav 


1 ^ 




2 


( 11 ) 


We assume that the errors have zero mean and uncorrelated 
due to the sub-patches are randomly extracted. So we have: 


[Ei (A)] =0,Ex [£,- (A) ej (A)] =0,/ ^ j (12) 

The expected error from the combined prediction is com¬ 
puted by 


Ecom = E^^ 
= Ea 


Q Y 


I (o,-e, (A) 


— —Fq 

= ^Eav 


A £ COiCOi£i{?iY 


1=1 

Q 


w L o)t£i{xy 


1=1 


(13) 


We extract more than 10 sub-patches to ensure W > 1. 
The result suggests that the average error of a object model 
can be reduced weighted combining all the sub-patch models 
using on the key assumption ( p^ that the errors of each 
model are uncorrelated by randomly choose the sub-patches. 


C. Adaptive Kalman Filter 

The top layer builds an adaptive Kalman filter based on 
the two ensemble layers to estimate the optimal system 
state and target image velocity so that the proposed tracker 
can overcome the temporary occlusion, missing and false 











































detections. We regard the observation result from the 
bottom and middle ensemble layers as the measurement. 
The discrete time system state and measurement at time k 
are given by x{k) = [x{k) ,y{k) ,Vjc{k) ,Vy{k)]^ and z{k) = 
ycoM = [xo{k),yo{k)]^, where x{k) ,y{k) ,Xo{k) ,yo{k) de- 
note the center coordinate in the image space corresponding 
to system state and measurement at time k respectively, 
Vx {k ), Vj (k) denote velocities in both two axis of system 
state. The state and measurement in the next time step k + 
1 is given by 


smaller than a threshold value j3 and N negative samples 
whose Euclidean distances to the sub-patch are bigger than 
a threshold value n that is fixed at beginning. We update the 
parameters of its i-th weak classifier in ^ like 

= A; u/ + (1-A)m^ _ 

(18) 

where CT^ denote mean and standard deviation of the N 
positive samples. And are updated in a similar way. 


x(^+ 1|^) = A(k+ l|^)x(^|^) + 5 1), 

z (k + 1) =Bx (k + 1 |k) + V (k+1) , 


A (k+ l\k) 


1 0 Ar 0 
0 1 0 Ar 

0 0 10 
0 0 0 1 




0 

1 ’ 


(14) 


where A(^+l|^) is modeled according to the Newton’s 
equation of motion, A^ is the time between two frames, 
5(k+l) and v(k+l) are assumed to be white Gaussian 
noises with zero mean and covariance matrixes Q (k), R (k) 
respectively. To achieve an adaptive Kalman filter, we take 
the mean of normalized scores in the middle layer to update 
these two covariance matrixes every frame like and ( p^ . 
We ascertain the ultimate object location at the current frame 
in the top layer with this adaptive Kalman filter. 


Q(^+ 1) — 




i=i 


Q(o) 


(15) 


R(/:+l) = 




i=l 


R(0) 


(16) 


D. Model Update 


It is important to update the target model continuously for 
robust tracking in the face of various difficult environment. 
The proposed method updates the hierarchical model via 
three mechanisms: re-extracting the sub-patches according to 
the object that we have found at the current frame, choosing 
the sub-patches that need to be updated and adjusting the 
parameters of the weak classifiers in the bottom layer. The 
update process is also shown in the Figj^ 

Once we find the object at the current frame, we need 
to correct the locations of all sub-patches in the middle 
layer due to the drift of the detection process. The way 
to re-extract the randomly overlapping sub-patches is fixed 
at the first frame. After that, we compress the features of 
these new sub-patches and put them into the weak classifiers 
in the bottom layer to obtain the updated scores = 

. We assume that is Gaussian, and 
the sub-patches to be updated are those whose scores satisfy 

G (Mc-crc,Mc + crc ),7 = l,2,...,e, (17) 

where are mean and standard deviation of the scores. 

Then, for the 7 -th chosen sub-patch, we extract N positive 
samples whose Euclidean distances to the sub-patch are 


IV. Experiments 

In this section, we show the experimental results of our 
method. Firstly, we present the implement details of the 
proposed tracker and the evaluation criteria to quantitatively 
assess the performance. Secondly, we validate the joint rep¬ 
resentation of our hierarchical ensemble framework with the 
base method. Thirdly, we compare our tracker to three most 
similar methods which are famous in the visual tracking field. 
Fourthly, we compare our method with 8 state-of-the-art 
methods which are feasible for robotic applications in terms 
of computational complexity and hardware requirements. 
Finally, we demonstrate that our tracker performs excellently 
for moving human tracking which is crucial for the tracking 
applications of robots. 

A. Implementation Details 

The proposed algorithm is implemented in Mat- 
lab(R2013a) and runs at 30 frames per second on an Intel 
17-4790 machine with 3.6GHz CPU and 8 GB RAM. For 
each sequence, the location of the target object is manually 
labeled at the first frame. For all reported experiments, we 
employ 150 weak classifiers in the bottom ensemble layer 
and randomly generate 11 sub-patches that are located inside 
the object and whose width and height are three quarters of 
the size of the object. We set learning rate A =0.85, maximum 
number of nonzero elements Z = 4 in random matrix R and 
thresholds j3=20, 71=213 in all experiments. 

In the experiment, we employ two evaluation criteria to 
quantitatively assess the performance of the trackers includ¬ 
ing the average overlap rate and the center location error. 
Given the tracked bounding box ROIt and the ground truth 
bounding box ROIg, we use the detection criterion in the 
PASCAL VOC challenge [26], score = ZaSuZf) ^ 
evaluate the success rate. 

B. Comparison with the base method 

Compressive tracking(CT) [12] employs the compressive 
sensing theory to compress the appearance models. It is 
reasonable to consider CT as our base method since the way 
to compress a image sub-patch is almost the same. In the 
bottom layer of our method, we build compressively sensed 
versions of sub-patches, while CT presents objects by the 
compressive appearance models globally. 

However, it’s insufficient to present the holistic object by 
a single appearance model just like CT especially in the 
case of tracking non-rigid objects. So we adopt the joint 
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Fig. 5: Pixel center location error of our method and the 
base method CT at each frame on four video sequences. 
Our method tracks the objects more accurately than CT on 
the four videos. 


Fig. 6: Pixel center location error of our method and the 
three similar methods at each frame on four video sequences. 
Our method tracks the objects more robustly than the three 
methods on these videos. 


representation which considers both global and local models 
of the targets to better handle significant appearance changes, 
deformations and occlusions. As shown in Figj^ and Fig|7j 
our method obtains more accurate tracking performances 
than the base method and it outperforms CT by 24% for 
the success plots and by 38.3% for the precision plots. 

C. Comparison with similar methods 

There are three methods LSK [14], OAB [20] and MIL 
[19] that are most similar to our tracker in recent years. The 
proposed method outperforms them as shown in Fig|^ and 
FiglZl 

LSK proposes a robust tracking algorithm with a local 
sparse appearance model which combines a static sparse 
dictionary with a sparse coding histogram. This method 
outperforms several sparse representation methods according 
to [3]. However, LSK neglects the temporal consistency in 
the target bounding box level while we take this into con¬ 
sideration by employing an adaptive Kalman filter. Therefore 
our method is more robust to occlusions than LSK, as shown 
in Fig|^ 

OAB and MIL are both boosting-based algorithms similar 
with ours. Our ensemble technique is much easier than the 
boosting of the two methods. However, they characterize 
the objects by global templates while we adopt both local 
representations and holistic templates. Thus we can better 
handle the deformations and occlusions, as shown in Fig|^ 

D. Comparison with State-of-the-arts 

For comparison, we run 11 state-of-the-art algorithms with 
the same initial positions of targets. These algorithms are 
CT [12], CN [27], Struck [11], TLD [15], ASLA [16], 
CSK [18], OAB [20], MIL [19], LSK [14], SCM [28] and 
VTD [17]. For fair evaluation, we evaluate the proposed 




Fig. 7: Precision and success rate plots of overall perfor¬ 
mance comparison for the 50 videos in the benchmark. 


HET against those methods using the source codes provided 
by the authors with adjusted parameters. We examine the 
effectiveness of the proposed approach on an online object 
tracking benchmark [3] tested with 50 sequences that cover 
most challenging tracking scenarios such as illumination 
variations, scale variations, occlusions, deformations, etc. 

Note that we don’t compare with some excellent methods 
such as MEEM [29], TGPR [30], MUSTer [31], etc, because 
they are not fast enough in robotic applications. For instance, 
the average time of MUSTer cost on the benchmark [3] 
is 0.287s/frame on a cluster node (3.4GHz, 8 cores, 32GB 
RAM). We also not consider some convolutional neural net¬ 
work based methods like FCNT [32] and HCF [33] because 
they require powerful GPUs to run the algorithms while still 
in a low frame rate. Although the performances of these 
trackers may be a little better than ours, their computational 
costs and hardware requirements are impracticable for robots. 

The comparison results of precision plots and success plots 
of OPE on benchmark are shown on Fig|^ As for the other 
top-ranking trackers in the benchmark, the results show that 
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Fig. 8: Tracking results of 12 trackers (denoted in different colors and lines) on 6 image sequences. Frame indexes are 
shown in the top left of each figure in yellow color. Results are best viewed on high-resolution displays. 


the proposed method achieves the best average performance. 
The performance of our approach can be attributed to the 
efficient ensemble methods on sub-patches with a spatial 
layout combining the adaptive Kalman filter. 

E. Human body tracking for robots 

In particular, we find that the proposed HET performs 
excellently for moving human tracking which is crucial for 
the tracking applications of robots. We choose 20 videos 
whose target objects are moving human bodies and includes 
the challenging conditions like occlusions, deformations, fast 
motions, background clutters, etc. Due to space limitations, 
we only show some shots of 6 videos among the 20 videos in 
Figj^ We can see when the human body objects undergoing 
large deformations or occlusions, other methods almost lose 
the objects except the proposed HET. 

Due to the suppleness structure of human limbs and the 
fiexibility of human movements, the most challenging factors 
in moving body tracking are deformations and occlusions. 
It is obvious that splitting up a human object into local 
parts can make the appearance models more fiexible. The 
local compressive appearance models of the proposed HET 
actually do these things. We build compressively sensed 
versions of local patches in the bottom layers which allows 
the proposed tracker to better handle occlusion and large 
appearance change. The tracking results on these moving hu¬ 
man videos with occlusions(OCC), deformation(DEF), back¬ 
ground clutter(BC), scale variations(SV), fast motion(EM) 
and illumination variation(IV) attributes based on the pre¬ 
cision and success rate metrics are persuasive, as shown in 
Eig|^ Our method almost ranks the first on these attributes 
according to the two criteria. 

The robustness and realtime performance to the human 
body tracking makes HET suitable for many robotic appli¬ 
cations such as human-computer interaction, home service 
robots, robot teaching systems and unmanned vehicles. 


V. Conclusion 

In this paper, we propose a novel hierarchical ensemble 
framework, where the representations of the target candidates 
are localized and compressed. We incorporate information 
including individual pixel features, local patches and holistic 
target models. The multiple ensemble layers exploit the 
intrinsic relationship not only between the individual pixel 
features and local patches, but also between the patches 
and the target candidates. In the bottom layer, the base 
ensembles are created using linear combinations of outputs 
from the base weak classifiers. A diverse collection of base 
ensembles are systematically combined in order to generate 
a more strong ensemble classifier in the middle layer and 
the scores of the local patches are normalized to produce a 
vector of weights of the base ensembles. Experimental results 
with evaluations against several state-of-the-art methods on 
challenging image sequences demonstrate the robustness of 
the proposed HET tracking algorithm. Since our method is 
real-time, general and robust, we plan to apply it to the 
tracking tasks of robots. In particular, the proposed HET 
is very efficient for moving human tracking, we can apply 
this to many applications such as unmanned vehicles, robot 
teaching system, etc. 
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